Glint-1.3 Is Live And It Is Just A Transformer Doing Its Best

Glint-1.3 is live. It has 982,656 parameters. It was trained on 100 billion tokens from FineWeb-Edu. It runs at 138,562 tokens per second on an RTX 5090. It is shy. It might output chuamliamce. If it does, try again. That is the model. That is the summary. That is the vibe.

Sometimes the best models are the ones that admit they are trying. Glint-1.3 is trying. It is also under one million parameters. Both facts are important.

The Scaling-Down Philosophy

We spent months adding features. SPIN. DPO. SleepGate. Retention gates. Recurrent loops. LoRA. Engrams. More parameters. More tricks. More complexity. And you know what? The features were hurting the models. The tiny models could not breathe. So we are doing the opposite now. Scaling down. Strip everything. Pure Llama. See how far simplicity goes.

Glint-1.3 is that experiment. Approximately one million parameters. No gimmicks. Just a transformer doing its best. It is the first model in the CompactAI scaling-down plan. It is also the simplest model we have released in a long time. Simplicity feels strange after so much complexity. It also feels correct.

982K

Parameters

100B

Training Tokens

138K

Tokens Per Second

256

Context Window

The Journey

The model improves monotonically over 95,000 training steps. Wikitext-2 cross-entropy loss drops from 4.29 to 3.08. For a one million parameter model, this is actually respectable. The loss curve does not spike. The validation metrics do not collapse. The model just learns. Slowly. Steadily. Predictably.

That predictability is the point. Complex architectures introduce variables that obscure the signal. Simple architectures expose the signal clearly. Glint-1.3 exposes the signal. The signal says that tiny models can learn. The signal says that simplicity works. The signal says that we were overthinking it.

Model Specifications

                        # Glint-1.3 architecture summary

                        Architecture: Transformer Decoder (Llama-style)

                        Parameters: 982,656

                        Hidden Dim: 128

                        Layers: 4

                        Attention Heads: 4

                        KV Heads: 4 (GQA)

                        MLP Intermediate: 384 (SwiGLU)

                        Context Length: 256 tokens

                        Vocab Size: 500 (ByteLevel BPE)

                        Normalization: RMSNorm

                        Position Encoding: RoPE

                        Embeddings: Tied input/output

                        # Simple. Standard. Boring. Effective.

The Benchmarks

All checkpoints were evaluated on Wikitext-2, BLiMP for grammaticality, and ARC-Easy for science QA. Sliding-window log-prob scoring methodology from the CompactAI benchmark suite. The results show steady improvement across training steps.

Metric	Best Checkpoint	Score
Wikitext-2 CE Loss	Step 95,000	3.06
BLiMP Accuracy	Step 11,500	64.2%
ARC-Easy Accuracy	Step 55,500	32.5%

We also merged the best checkpoints per benchmark via per-parameter-group SLERP. The merged model achieves superadditive BLiMP gains. It exceeds individual bests on certain metrics. Weight averaging works. Model soups work. Simplicity enables these techniques. Complexity obscures them.

Training Details

The model was trained on FineWeb-Edu sample-10BT. Batch size was 4,096 with gradient accumulation of one. Sequence length was 256. Learning rate was 8e-4 with cosine decay and a 200 step warmup. Weight decay was 0.05. Max gradient norm was 0.5. Optimizer was AdamW with fused kernels. Precision was bfloat16. Hardware was an NVIDIA RTX 5090 throughout. Training time was approximately 30 hours for 95,000 steps.

These details matter because they are reproducible. Because they are simple. Because they do not require a cluster. Because they prove that tiny model research can happen on consumer hardware. That is the CompactAI promise. That is the Glint-1.3 demonstration.

Limitations

The context window is 256 tokens. That severely limits long-range dependencies. The knowledge is extremely limited due to parameter constraints. Coherence may fade after a few sentences. Repetition tends to emerge at higher temperatures. Reliability is not suitable for production applications. The purpose is research, education, and architectural experimentation.

These limitations are features. They define the scope. They set expectations. They remind us that tiny models are tiny. They also remind us that tiny models can still learn. Both reminders are valuable.

How To Use It

You can try Glint-1.3 in the CompactAI Model Runner space on HuggingFace. You can download the weights. You can run inference locally. You can experiment. You can break it. You can tell us what you find. That is how open source works. That is how research progresses. That is how tiny models improve.

Try Glint-1.3: https://huggingface.co/spaces/CompactAI-O/CompactAIModelRunner

The model is experimental. It might output chuamliamce. If it does, try again. It is shy. It is also under one million parameters. That is the trade-off. That is the charm.

Final Thoughts

Glint-1.3 is live. It is simple. It is tiny. It is trying. It runs fast. It learns steadily. It outputs chuamliamce sometimes. If it does, try again. That is the model. That is the summary. That is the vibe.

We are scaling down. We are stripping complexity. We are seeing how far simplicity goes. Glint-1.3 is the first step. More steps will follow. The journey continues. The progress is weird. The simplicity is refreshing.

Thank you for watching the experiment. Thank you for trying the model. Thank you for accepting that tiny models are tiny. Thank you for believing that tiny models can still matter. Both beliefs are valid. Both beliefs are necessary.